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Abstract 

In this paper ,some definitions and concepts of stability are given, an application 
of stability to regularization in Hilbert space are performed with modification, 
some illustration examples and conclusion, are listed. 



1. Introduction 

It has long been known that when trying to estimate an unknown function 
from data, one needs to find a tradeoff between bias and variance. Indeed, on 
one hand, it is natural to use the largest model in order to be able to approximate 
any function, while on the other hand, if the model is too large, then the 
estimation of the best function in the model will be harder given a restricted 
amount of data. Several ideas have been proposed to fight against this 
phenomenon. One of them is to perform estimation in several models of 
increasing size and then to choose the best estimator based on a complexity 
penalty (e.g. Structural Risk Minimization). One such technique is the bagging 
approach of Breiman (1996)[5] which consists in averaging several estimators 
built from random sub samples of the data. In the early nineties, concentration 
inequalities became popular in the probabilistic analysis of algorithms, due to 



\ 



the work of McDiarmid (1989) and started to be used as tools to derive 
generalization bounds for learning algorithms by Devroye (1991). Building on 
this technique, Lugosi and Pawlak (1994) obtained new bounds for the k-NN, 
kernel rules and histogram rules. 



A key issue in the design of efficient machine learning systems is the 
estimation of the accuracy of learning algorithms, among the several 
approaches that have been proposed to this problem, one of the most prominent 
is based on the theory of uniform convergence of empirical quantities to their 
mean .this theory provides ways to estimate the risk (or generalization error) of a 
learning system based on an empirical measurement of its accuracy and measure 
of its complexity, such as the Vapnik-Chervonenkis(VC)dimension or the fat- 
shattering dimension. We explore here a different approach which is based on 
sensitivity analysis .sensitivity analysis aims at determining how much the 
variation of the input can influence the output of a system .it has been applied to 
many areas such as statistics and mathematical programming .in the latter 

domain ,it is often referred to as perturbation analys. 



Uniform stability may appear as a strict condition . actually we will observe 
that many existing learning methods exhibit a uniform stability which is 
controlled by the regularization parameter and thus be very small, many 
algorithms such as Support Vector Machines(SVM) or classical regularization 
networks introduced by Poggio and Girosi (1990)[9] perform the minimization 
of a regularized objective function where the regularizer is a norm in a 

reproducing kernel Hilbert space (RKHS): 



N (f) = II f I 1 k 1 -> where k refers to the kernel. 



2. Some Definitions and Concepts 



Def.(l) Learning Algorithm [ 4]:- a learning algorithm is a function A from Z m 
into F a Y x which maps a learning set S onto a function A s from X to Y ,such 

that X,Y cz R are an input and output space respectively , and S a training set 

S = {z 1 =(x 1 y 1 ),...z m =(x m ,j> w )}, of size m in Z = XxY drawn i.i.d. from an 

unknown distribution D. 



Def.(2) Stability [3]:- consider the system 

x y) = f[x(t),u(t)lxeX^R n ,ueU^R m , 

A stable x is an equilibrium state if there exists u such that f(x,u) = 0. 



Def.(3) Hilbert Space [2]:- An inner-product space which is infinite dimensional 
and complete is called a Hilbert space (real or complex .according to whether 
the scalar filed is real or complex). 



Def.(4) Hypothesis Stability [ 8] :-An algorithm A has hypothesis stability /? 
with respect to the loss function £ if the following holds 



Vze{l,...,m},£ \/£(A z)/]< p. 

Note that this is the L x norm with respect to D, so that we can rewrite the above 
as E s [ l(A s ,.)-t(A. ,,.)]< 0. 



Def.(5) Pointwise Hypothesis Stability^]:- An algorithm A has point wise 
hypothesis stability (5 with respect to the loss function^ if the following holds 



\/ie{l,...,m},EJ/£(A„,z.)-£(A .,z.)/]<j3. 

O O I r» \ I I 

Another, weaker notion of stability was introduced by Kearns and Ron. It 
consists of measuring the change in the expected error of the algorithm instead 
of the average pointwise change. 



Def.(6) Error Stability^ 3]:- An algorithm A has error stability . with respect to 
the loss function^ if the following holds 



m 



\/S g Z ,V7 g {!,..., ra}, 



E z [l(A s ,z)]-Ez[l(A xr z)] </3, 



Which can also be written 



V^GZ ffl ,ViG{U< R(S)-R"(S) <fi. 



m 



Def.(7) Uniform Stability]^ ]:- An algorithm A has uniform stability (5 with 
respect to the loss function^ if the following holds 



VSeZ m ,Vie{l,...,m}, £(A S ,.)-£(A,,.) < jB. 

*-> 00 



Def.(8) convex set [2 ]:- A set S in vector space X on the field F, is called 
convex if 
Ax + (1-A)y eS, 



Vx,yeS , V 0<A<1 



i.e 



AS + (1-A)S^S 



Def.(9) A loss function[6 ] :-A loss function ^defined on FxY is a -admissible 
with respect to F if the associated cost function c is convex with respect to its first 
argument and the following condition holds 

Vy v y 2 eD >Vy' eY >\ c (y v y')- c (y 2 >y^ 



<<j 



y x -y 2 



Where D = {y : 3f e F ,3x e X ,f(x) = y} is the domain of the first 

argument of c. thus in the case of the quadratic loss for example ,this condition 
is verified if Y is bounded and F is totally bounded, that is there exists M < oo 
such that 

V/ € F, / I < M and Vy e 7, y | < M , such that F is a convex subset of a linear 
space. 

Def.(lO) Classification Stability[5 ] :- a real- valued classification algorithm A has 
classification stability /? if the following holds. 



VSeZ m 9 Vie{l,... 9 m} 9 A s (.)-A,(.) ^fi. 



oo 



3. Survey of Online Kernel Methods [6] 



The perception algorithm (Rosenblatt, 1958) is arguably one of the simplest 
online learning algorithms. 

Given a set of labeled instances {(x 1 y 1 ) ? (x 25i y 2 )...(x m5 j; W7 )} c X x7 where X<= R 



and v 



orithm 



= 0. It then 



predicts the label of a new instance x to be j> = sin g((0,x)). . If y differs from the 

true label y then the vector = is updated as e = + yx. This is repeated until 

all points are well classified. The following result bounds the number of 
mistakes made by the perceptron algorithm (Freund and Schapire 5 1999, 
Theorem 3) this generalizes the original result for the case when the points are 
strictly separable , i.e, when there exists a such that |0| = 1 and y.^O.x^y 

for all (*. , j; . ) .the so-called kernel trick has recently popularity in machine 

learning . as long as all operations of an algorithm can be expressed with inner 
product 5 the kernel trick can used to lift the algorithm to a higher-dimensional 
feature space: the inner product in the feature space produced by the mapping 
<j> : X -> H is represented by a kernel k(x,x r ) = (0(x),0(x')) . We can now drop the 

condition X Q R d but instead require that H be a reproducing kernel Hilbert 
space (RKHS). 
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4. Some Theorems About an Application of Stability to 

Regularization and Illustration Examples 



Many algorithms such as Support Vector Machines (SVM) or classical 
regularization networks introduced by[9] perform the minimization of a 
regularized objective function where the regularizer is a norm in a reproducing 

N(f) = II f II k 2 , kernel Hilbert space(RKHS): 

where k refers to the kernel .the fundamental property of a RKHS 
F is the so-called reproducing property which writes 



In particular this gives by Cauchy-Schwarz inequality 



(2) 



V/ e F, Vx e X, |/(x)| < |/| Jk(x,x) 



(3) 



Theorem (1):[8] 
let F be a reproducing kernel Hilbert space with kernel k such that 

Vx e X,k(x,x) <k < oo. let I be a -admissible with respect to F .the 
learning algorithm A defined by 

\ m 2 

A s =zrgmm gGF -^£(g,z i ) + Ag\\ k , 

Has uniform stability /? with respect to £ with 



....(4) 



P^ 



2 2 
2Am 



Proof 



We have 



d N (g>g') = 



g-g 



Thus, by the (Let £ be a- admissible with respect to F, and N a functional defined on F 
such that for all training sets S, R r and R r l have a minimum (not necessarily unique ) in F 

.let f denote a minimizer in F of R r , and for i=l,..,m Jet f l denote a minimizer in F ofR 
.we have for any t=[0, 1] , 



\i 
r 



\i 



\i 



N(f)-N(f + tAf) + N(D-N(r-tAf)< 



to- 
Am 



\i 



Af(x t ), where Af = (/" -/). ) ,gives 



A/ 



2 <^L 

k Am 



A/(*,. ) 



Using (2), we get 



Af{x t )< Af t Jk& 9 x i )<K:Af 



So that 



h 2Xm 



Now we have ,by the a -admissibility of £ 



M 



£(f,z)-£(r,z)<af(x)-r(x) 



\i 



= <J 



A/(*) 



(4) 



Which , using (2)again , gives the results. 



Theorem (2): [8] 

Let A be the algorithm of theorem(l) where I is a loss function associated to a 
convex cost function c(. , .) .we denote by B(.) a positive non-decreasing real- 
valued function such that for all y e D . 



\/y'eY,c(y 9 y')<B(y) 



For any training set S, we have 



/ 



5(0) 



< 



And also 



\/zeZ 9 0<£(A S9 z)<B(K,j^-) 



Moreover , t is a -admissible where a can be taken as 



(5) 



<7 = sup sup 

y <y\<B(Kj-^) 



dc 



dy 



(y>?) 



Proof 



We have for / = A. 



1 



m 



i?,.(/)<i?,.(0) = -X^(0,z,)< J S(0), 

And also R r (f) ^ A f which gives the first inequality .the second inequality 

follows from(3) .the last one is a consequence of the definition of a - 
admissibility. 



Theorem (3):[1 ] 

Let {(x 1 y 1 ) ? (x 25< y 2 )...(x m5 j; m )} be a sequence of labeled examples with ||xi|| < 

R. Let 6 be any vector with ||0|| = 1 and let y>0. Define the deviation of each 
example as 



d t = max(0 9 y -yifaxi)), and let D = S^d 2 . Then the number of mistakes of 



the perception algorithm on this sequence is bounded by ( 



R + D 

r 



) 2 . 



Theorem (4): [8 ] 

Let A be an algorithm with uniform stability /? with respect to a loss 

function £ such that 0<£(A s ,z)<M, for all zeZ and all sets S .then ,for 

any m > 1 , and any 8 s (0,1), , the following bounds hold with probability at least 
1 - 8 over the random draw of the sample S, 



;?<;? +2/? + (4m/? + M) 



emp 



lnl/<? 
2m 



(6) 



And 



R<R loo +fi + (4mfi + M) 



\nl/S 



2m 



(7) 



Example (1);[8 ] 
stability of bounded SVM regression 

Assume k is a bounded kernel, that is k(x,x) < k 2 and Y=[0,B] 

Consider the loss function 



l(f,z)=f(x)-y 






o 



if f(x) -y<s 



f(x) - y 



-8 



otherwise 



This function is 1 -admissible and we can state B(y)=B .the SVM algorithm for 
regression with a kernel k can be defined as 



And we thus get the following stability bound 



P< 



2 2 
G K 



2 Am 



i 



m 



m i=l 



A s =ar&nin keF -jy(g,z i )+A g|| Moreover, by theorem (2) we have 



v 



VzeZ,0<£(A s ,z)<K u B 

A 

Plugging the above into theorem (4) gives the following bound 



„ „ k 2 ,2k 2 B, \nl/S 

Such that R emp is the simplest estimator (empirical error) 



(8) 



Example(2):[ll] 

Absolute stability of control system 
Suppose system of differential equation 

X — AX m 



(9) 



Is represent the free motion to body .such that 

A = (a tj ) matrix of order nxn , constant and stable matrix and non-singular. 

Let x is vector (column) in R n and let system of control arbitress by the 
following equation 

x — Ax — ub 

u=f(Z) 



(10) 



Z =c x- pu 



Such that c, b are constant vectors in R n and c is transpose to vector c . p and u 
are elements in R and f is element in admissible characteristic function 3 , then 
system (10) is absolutely stable if for all solution ( x ,u) to system(lO) is 
(x(t),u(t)) -> (0,0) when t is converge to oo + .the variable of control of system (10) 

is u Jet us take transformation 

x = y 

z = c T x- pu 

That trans the system (10) to the new formal following ; 

>-*-*« (id 

z = x y-pf(z) 

Its keep on the estate stability ,we must that transformation non-singular ,that is 

A -b 

* , then p = c T A~ i b 

c -p 

So the original point is unique control point of system (11). 



A 



Example (3): [10] 

stability of regularized least squares regression 

We will consider the bounded case Y=[0,B] .the regularized least squares 
regression algorithm is defined by 



1 A . . 2 



k 



A s =argmin F — ^(g.z^ + Ag 

Where £(f,z) = (f(x)-y) 2 . 

We can state B{y) = B 2 so that is 2B-admissible by theorem (2) .also we have 

VzeZ,0<£(A s ,z)<K x ' B 

A 

The stability bound for this algorithm is thus 



Am 

So that we have the generalization error bound 

*^ + ^ + (^ + 2*)S (12) 

Am A V 2m 



Remark 

The function / is said to be element in admissible characteristic function 

3, if 

1. /is function defined and continues on (-00,00) 

2. /(0) = , xf(x)y0 for x*0 

00 

3. \f{s)ds and I f{s)ds are existes. 

-00 



5. Conclusions 

There are many ways to define and quantify the stability of a learning 
algorithm. The natural way of making such a definition is to start from the 
goal : we want to get bounds on the generalization error of specific 
learning when the algorithm and we want these bounds to be tight when 
the algorithm satisfies the stability criterion. Such as support vectors 
machines (SVM) or classical regularization networks perform the 
minimization of a regularized objective function where the regularizer is a 
norm in a reproducing kernel Hilbert space .we introduced definitions and 
examples of types of stability and some concepts that we needed us. The 
results about the uniform stability of (RKHS) learning are given . 
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